Identification of misdiagnosis by deep neural networks on a histopathologic review of breast cancer lymph node metastases

The frozen section (FS) diagnoses of pathology experts are used in China to determine whether sentinel lymph nodes of breast cancer have metastasis during operation. Direct implementation of a deep neural network (DNN) in clinical practice may be hindered by misdiagnosis of the algorithm, which affects a patient's treatment decision. In this study, we first obtained the prediction result of the commonly used patch-DNN, then we present a relative risk classification and regression tree (RRCART) to identify the misdiagnosed whole-slide images (WSIs) and recommend them to be reviewed by pathologists. Applying this framework to 2362 WSIs of breast cancer lymph node metastasis, test on frozen section results in the mean area under the curve (AUC) reached 0.9851. However, the mean misdiagnosis rate (0.0248), was significantly higher than the pathologists’ misdiagnosis rate (p < 0.01). The RRCART distinguished more than 80% of the WSIs as a high-accuracy group with an average accuracy reached to 0.995, but the difference with the pathologists’ performance was not significant (p > 0.01). However, the other low-accuracy group included most of the misdiagnoses of DNN models. Our research shows that the misdiagnosis from deep learning model can be further enriched by our method, and that the low-accuracy WSIs must be selected for pathologists to review and the high-accuracy ones may be ready for pathologists to give diagnostic reports.


Results
Influence of the sampling ratio and DNN on the classification metrics. Considering that the sampling ratio and different commonly used DNN architecture may impact the WSI classification metrics, we designed 18 patch-DNN models based on the sampling ratio and patch-DNN method (Fig. 1). We found that the average values of the sensitivity, precision and F1 score of the 18 different models were all above 0.92, while the average values of the specificity, accuracy and AUC were above 0.97. Two-way ANOVA showed that the sampling ratio, DNN architecture and their interaction had no significant effect on the classification metrics (Table S1).

Misdiagnoses made by patch-DNN models.
To summarize the statistical characteristics of misdiagnosis among the 18 models, we found the total number of misdiagnosed WSIs was 104, including 60 falsepositive and 44 false-negative diagnoses, after the overlapped misdiagnosed samples in the different models were removed. To analyze the reasons for the inconsistency in the diagnoses between these 18 models and the pathological diagnosis of sentinel lymph nodes, the pathology review group reviewed the FSs that were misdiagnosed by the above models. In one FS, the cut edge of a mammary gland was mistaken for a sentinel lymph node, and another model erroneously classified a lymph node without metastasis as a lymph node with metastasis ( Fig. 2A). These two WSIs were obtained by taking an incorrect FS during the process of section scanning. In addition, one WSI was difficult to diagnose based only on FSs. In this case, we observed a few suspicious carcinoma cells in the capsule of the lymph node in the FS, while these cells disappeared in the deep section in the paraffin section. Therefore, we could not obtain a definitive diagnosis on this FS. After these three samples were removed, the model made prediction errors from the remaining 101 WSIs, including 41 false negatives (Fig. 2B) and 60 false positives (Fig. 2C).
Notably, 30 WSIs (28.85%) were predicted incorrectly by more than 10 of the 18 models, indicating that the patch-DNN method misclassified certain samples during pathological diagnosis. It will be clinically necessary to refine the patch-DNN to further improve its diagnosis performance.  (Fig. 1), and then invited 15 experienced pathologists (attending or above doctors with 6-33 years of working experience) from CICAMS to review the FSs manually to establish the Poisson distribution model according to the frequency of incorrect diagnosis (Fig. 3). We compared the prediction results of the 18 patch-DNN models with the diagnoses of the 15 pathologists (Fig. 3). The results showed that the patch-DNN models produced significantly higher errors for normal and micrometastasis WSIs (p < 0.01, Table S2) than those in pathologists. In addition, the five classification evaluation metrics, namely, sensitivity, specificity, precision, accuracy, and F1 score, were significantly different from those of the pathology experts (p < 0.01, Table S3). All of the above results suggest that the diagnostic accuracy of the pathologist consensus was higher than the prediction performance of the DNN models.
High-accuracy predicted WSIs can be separated from low-accuracy WSIs by RRC ART . To make patch-DNN models meet the diagnostic requirements for FSs, we divided the prediction results of the patch-DNN models into two categories: high-accuracy WSIs and low-accuracy WSIs (Fig. 1, Supplementary materials).
Based on the above assumption, we labeled each WSI again according to the prediction results of each patch-DNN model and the ground truth (Fig. 1). The number of incorrectly predicted WSIs was far smaller than the number of correctly predicted WSIs as a result of imbalanced data with a ratio of 1:41 on average (Table S4). We designed a classification algorithm combining the relative risk ratio and classification and regression tree (Fig. 1,   Figure 1. Overview of the Expert-DNN framework presented in this study. See Materials and Methods for complete details. There was two steps in this framework. In step 1, the patch-DNN workflow includes patch and WSI classification. Patches were extracted from tumor and normal regions of WSIs and as input data for DNN training. Patch coordination and probability generated by the model were then used for heatmap construction. A machine learning model (random forest) was trained based on the features extracted from the WSI heatmap and given a WSI-level prediction. The three sampling ratios (1:3, 1:5 and 1:7) correspond to three DNNs (InceptionV3, ResNet101 and Xception), respectively, and the experiment is repeated once. Therefore, we can obtain 18 patch-DNN models. In step 2, a Poisson distribution for the pathology experts and a RRC ART were constructed. For the Poisson distribution, the values of λ for the normal, micrometastasis, and macrometastasis cases were estimated based on the incorrect diagnosis frequency of 15 pathology experts for the 160 lymph node FSs. The patch-DNN prediction result was compared with the ground truth to relabel the WSI as either easy (0) or hard (1), corresponding to the prediction result being the same as the ground truth or not. The easy and hard labeling information and the features extracted from the heatmaps were used as the input of the RRC ART , and then the high-accuracy group, with an incorrect prediction rate identical to that of the pathology experts, may be ready for pathologists to give diagnostic reports, while the low-accuracy group, with a high frequency of incorrect predictions, must be selected for pathologists to review. www.nature.com/scientificreports/ Supplementary materials). Application of the RRC ART algorithm on the test set divided all WSIs into a highaccuracy subset and a low-accuracy subset. A significant difference in misdiagnosis was detected between the results of the original 18 models and the expectations of the pathology experts (p < 0.01, Table 1). However, after enrichment by the RRC ART algorithm, no significant difference between the expectation of incorrect prediction by the patch-DNN models on the high-accuracy WSIs and that of the pathology experts (p > 0.01) was observed. However, for low-accuracy WSIs, a significant difference was found between the number of recognition errors by the 18 models and the expectation of the pathology experts (p < 0.01). Significantly, the average proportion of high-accuracy WSIs set to all sets from the 18 models was 83%, and the accuracy for the high-accuracy set reached to 0.995 (average of 4.11 misdiagnosis). Our RRC ART method significantly reduced the misdiagnosis rate for 16 patch-DNN models ( Table 1, p < 0.05, fisher exact test). The above results support that RRC ART can help patch-DNN more accurately diagnose most lymph node metastasis from breast cancer.

Discussion
In this study, we focus on the misclassification in DNN algorithms. This is a not well-investigated research area.
Here, we hypothesized that the patch-DNN may give the same accuracy in the histopathologic review of lymph node metastasis from breast cancer in most FSs as that of experienced pathologists. To test our hypothesis, we presented a modified classification and regression tree and the Poisson distribution for classifying the risks of misdiagnoses made by common patch-DNNs. We applied the method to 2362 sentinel lymph node FSs collected in CICAMS. Our experiments confirmed the hypothesis and showed the improvement of security issues caused by using DNN in pathologic diagnosis directly and the integration of DNN in a clinical workflow is feasible.  www.nature.com/scientificreports/ We established a Poisson distribution model according to the frequency of misdiagnosis by pathology experts on 160 FSs. The Poisson distribution can be used to compare the performance of machines and humans with a dataset of thousands of WSIs. Previous studies used the sensitivity, specificity, etc., to compare the best DNN model with the performances of a small number of pathologists 6,12,23,24 but cannot assess the average performance of different models and cannot compare DNN models and pathologists on datasets that have not been examined by pathology experts. Our results showed that the accuracy of examining FSs by pathology experts was generally higher than that of the 18 models. This result is substantially different from the results in most articles 12,23,24 . The possible reasons for the high degree of homogeneity in the FS diagnosis of sentinel lymph nodes by our pathologists may be as follows: first, the diagnosis of lymph node metastasis is a basic and essential skill for cancer pathologists, as they have been heavily trained on this skill; second, the 5 senior pathologists in this group are all breast cancer specialists and have extensive experience; third, the differences in years of practice may have little effect on the consistency of lymph node diagnosis in the results of our review of 160 randomly chosen FSs.
We used the RRC ART algorithm to classify WSIs into high-accuracy and low-accuracy subgroups. The probability of model recognition error of the former is similar to that of the expert review. The latter subgroup requires manual review to meet the diagnostic requirements for FSs. This candidate dual-diagnosis model combining a DNN and experts may reduce the number of required expert diagnosis. Furthermore, we set FSs without negative and macrometastasis prediction errors as filtering thresholds for the models in the high-accuracy subset and retained four qualified DNN models. This process can provide reliable evidence for the choice of operation mode in China. Hence, our approach may improve the diagnosis accuracy of the patch-DNN model.
There were some limitations in our study. First, we used a histopathologic review of lymph node metastasis from breast cancer in FS WSIs to evaluate the accuracy of the patch-DNN. A histopathologic review may not reflect real-world performance. However, this pilot research demonstrated the feasibility of the patch-DNN in histopathologic diagnosis with FSs. Second, we collected 2,362 FSs into this study without considering the clinical features of the patients. In some selected patients, the positive rate of FSs may be higher than in others 25 . However, FS diagnosis is widely used in most hospitals in China without any selection. Our research may reflect the current situation in China. Additionally, Dataset was split on at the WSI level. This means that WSIs from the same patient can be in both training and test set and may result in that classification metrics is overestimated. However, RRC ART can make enrichment of the high-accuracy WSIs, reducing the effect on the following clinical workflow. Third, we found no isolated tumor cells in this dataset. However, isolated tumor cells are a rare condition in lymph node metastasis from breast cancer 26 . This may be why no cases are found in some small samples. Finally, we compared the accuracy of the patch-DNN with WSIs and pathologists with traditional glass slides. There are some differences in reading WSIs and glass slides. However, most pathologists are more accustomed to glass slides, and in our group, there were no pathologists who had been trained in reading WSIs. We intended to evaluate the degree of consistency among different pathologists; however, glass-slide reading is closer to the real-world situation for pathologists.

Conclusion
We employed RRC ART and Poisson distribution, using quantitative WSI data to improve the prediction results of the patch-DNNs used for lymph node metastasis detection from breast cancer. Based on the prediction results of DNN and the features extracted at the stage of WSI classification, we can directly execute RRC ART and distinguish high-accuracy WSIs from low-accuracy WSIs. In the histopathologic review of lymph node metastasis from breast cancer, our method can be served as an adaptor module to connect the patch-DNN pipeline and the clinical workflow, reducing the security risk of using patch-DNN directly by recommending the low-accuracy WSIs for pathologists to review and the high-accuracy ones for reports.

Images from human subjects. Our research was approved by the Ethics Committee of National Cancer
Center/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College (20/209-2405). The detailed statement of the approval letter was provided in the Related files. Briefly, in July 6, 2020, the Ethics Committee received the application for approval of the clinical research plan-computer recognition strategy for lymph node metastasis of breast cancer and the application for exemption from informed consent. This met the requirements of rapid review according to the ethics committee SOP, the ethics committee held a serious discussion and voted in July 8, 2020. The number of voters was 2, and agreed: 2 votes, against: 0 votes. Approval result: agree, agree to carry out the study. In this part of the series, patient consent was not required, as no risk to the participants was anticipated. The authors confirm that all methods were carried out in accordance with relevant guidelines and regulations.
We collected FSs of sentinel lymph nodes from breast cancer patients from the Department of Pathology, Cancer Institute & Hospital, Chinese Academy of Medical Sciences (CICAMS), China, from January 2017 to November 2019. The 8-10 µm FSs were hematoxylin and eosin-stained. The final dataset consisted of 2362 patient-exclusive WSIs (129 micrometastases, 353 macrometastases and 1879 normal). All the information on the 499 patients and WSIs was deidentified. We scanned the glass slides containing these sections with a Nano Zoomer S210 scanner at 40X equivalent magnification. All slides were scanned at a resolution of 0.243 μm per pixel.

Reference standard.
We followed the method of Bejnordi et al. 6 to perform annotation with some small modifications. In brief, all metastatic lesions from positive FS WSIs were first annotated manually by 2 residents with 2-3 years of work experience using the open-source tool ASAP 1.8 (https:// github. com/ compu tatio nalpa tholo gygro up/ ASAP/ relea ses), and then each annotated WSI was reviewed in detail by a breast subspecialist pathology review group composed of 3 pathology experts (attending doctor and above) with a mean working experience of 14.3 years (9-18 years). These reviewing reports of the FSs by the pathology expert group were used as the ground truth. The type of lymph node metastasis was divided into three groups: micrometastasis, macrometastasis and negative, according to the American Joint Committee on Cancer (AJCC 8th) 30 .
Dataset splitting. The dataset was divided into training, validation and test subsets at the WSI level with repeated random separation until no significance (P > 0.10) was detected between any two subsets with respect to all features (Table S5). The P value was calculated using a t-test for continuous features and Fisher's exact test for the categorical features. All the features were extracted from the tissue contours of both positive and negative WSIs and metastatic lesion contours from only positive WSIs.
To explore the relationship between the number of annotated WSIs and the saturation of the classification metrics including the sensitivity, specificity, precision, accuracy, F1 score and AUC, 2362 FSs were randomly divided at the WSI level for training with size serials (100, 300, 600, 1000, 1500), with 202 FSs used for validation and 630 for testing (Table S6). In the training group, each previous training dataset became a subset of the following datasets. The whole process was randomly repeated 5 times. Our results showed that when the number of training WSIs reached 600, all six metrics began to saturate (Fig. S1A-F). Therefore, 600 WSIs for the training sets was the minimum number required to saturate the above classification evaluation metrics. Unless otherwise stated, 600 WSIs were used in the following model training (Table S6).
For the task of influence of the sampling ratio and DNN on the classification metrics, the FS WSIs were randomly divided into the training, validation and test subset with their sizes fixed at 600, 202 and 1560, respectively (Table S6). For validation of the RRC ART algorithm task, the 1560 test subset above was further split into training and test set with sizes fixed at 780 and 780, respectively. Patch sampling. Patches for DNN training were generated by sampling from the training WSIs. First, due to the large size of the WSIs, we used Otsu's method 31 to efficiently discard all background and fat patches in each WSI. Second, to avoid sampling biases, we designed a random-sampling strategy according to the following: (1) Each WSI with or without metastatic cancer was selected with equal probability. (2) For each region of interest in one WSI, we collected enough patches from each region of interest with random coordinates generated from a uniform distribution to form a candidate patch set of the WSI. (3) Normal and tumor patches with fixed numbers were randomly selected from the patch-set for each WSI. The fixed number can be calculated easily according to the tumor-normal patch ratio. Previous work has reported that increasing the normal to tumor patch ratio could reduce the false positives 32 , in our work, we tentatively set the ratio of tumor to normal patches at 1:3, 1:5, and 1:7.
Patch-DNN method for frozen-section diagnosis. Our patch-DNN method consists of a patch-level classification and a WSI-level classification. At the patch-level classification stage, a random-sampling method was used to extract 256 × 256 patches from the WSI training set. The patch-based classifier was trained to estimate the class of each patch. Then, we partitioned each WSI into 256 × 256 patches without overlapping. The probability values of the patches, predicted by the classifier in the first stage, were embedded into a heatmap image.
At the WSI-level classification stage, the corresponding heatmaps were used as the basic data for the following postprocessing method to discriminate the classes of WSIs. We extracted several features (Table S5) from heatmaps to train a WSI-based classification model, and a random forest was trained as the WSI-based classifier.
We used three sampling ratios (1:3, 1:5, and 1:7) corresponding to three commonly used DNNs (InceptionV3, ResNet101 and Xception) and repeated patch sampling and DNN training twice to construct a total of 18 patch-DNN models to study the influence of the sampling ratio and DNN on the classification metrics. Considering the long-term cost (2 months for 18 models) and the small difference between the results of the two repeated experiments, we did not repeat the experiment more times.

DNN model training and validation.
The training workflow of the DNN model is as follows. We trained the DNNs with the stochastic gradient descent optimizer in TensorFlow (version 1.8.0) 28 . The initial learning state was 0.01, and the decay rate was 0.1. All models were initialized with the pre-trained weights on ImageNet (https:// github. com/ tenso rflow/), and cross entropy was used as the loss function to update the network parameters. To test the stability and generalization of the system framework, we separately selected InceptionV3, Xcep-tion71, and ResNet101 as our backbone CNN model, all of which were trained on patches with 256 × 256 pixels (~ 0.22 μm/pixel) at 40X magnification from the training WSIs. Then, based on the validation set, we completed the training of the DNN model within 75 epochs (approximately 72 h) and without overfitting. www.nature.com/scientificreports/ Classification metrics. Six general statistical classification metrics, including sensitivity, specificity, precision, accuracy, F1 score (Eq. 1) and AUC value, were used to estimate the performances of the classifiers. The misdiagnosis rate is equal to (1-accuracy). The AUC value was obtained by using scikit-learn 29 (version 0.23).

Incorrect diagnosis rate of FSs by pathology experts.
We hypothesized that no systematic errors (misdiagnosis made by most of the experts or the models) would be made by the expert group in the diagnosis of 160 FSs of sentinel lymph nodes (Fig. 1). We also hypothesize that pathology experts and model diagnosis (prediction) of a certain number of WSIs obey the Poisson distribution (Eq. (2)).
Incorrect diagnosis rates for pathology experts were scaled to per WSI per examination. The incorrect diagnosis rate (λ 0 ) was calculated by adding up the number of incorrect diagnosis of the 15 pathology experts for all 160 WSIs randomly selected from the test set and then dividing by the total number of diagnosis (160 * 15).
The P value was calculated as: where x is the observed number of incorrect diagnoses (predictions) for a certain number of WSIs (n) and λ is calculated as n*λ 0 . Relative risk classification and regression tree algorithm. We hypothesized that the prediction results of the patch-DNN models could be divided into two categories: high-accuracy WSIs, with a frequency of model recognition errors similar to that of pathology experts, and low-accuracy WSIs, with a significantly higher frequency of model recognition errors than that of experts. The probability p is defined as the ratio of the number of incorrectly predicted WSIs (abbreviation, S wr_low ) to the total number of predicted WSIs (abbreviation, S low ) with low accuracy; variable q is defined as the ratio of the number of incorrectly predicted WSIs (abbreviation, S wr_high ) to the total number of predicted WSIs (abbreviation, S high ) with high accuracy. By this definition, we can easily deduced S wr_low + S wr_high is equal to the number of the hard WSIs, corresponding to the prediction result different from the ground truth. We add 1 to the numerator of p and q to prevent the number of incorrectly diagnosed WSIs from equaling to zero, which increases the computational robustness of the WSI diagnosis results. Their calculation methods are given in Eqs. (4)(5)(6). As a matter of fact, the set S high is always larger than S high due to our proposed classification model, so we design the hyper parameter high_per given in Eq. (7).
We implemented the cost-sensitive learning process 33 by combining a RR index and a classification and regression tree (CART) 34 . The RR index is the ratio of the incorrect diagnosis probability between the low-and high-accuracy WSI sets and is designed as the loss function for the training of the RRC ART , which mainly implies the risk of incorrect prediction for WSIs with low accuracy relative to that of WSIs with high accuracy. The algorithm below shows how to replace the Gini method with our relative risk method.
Algorithm 1 describes the stop conditions for the recursive method, which is also used in a CART classification tree, and algorithm 2 shows how to replace the Gini method with our relative risk method (Supplementary materials).